NEWTON-BASED OPTIMIZATION FOR NONNEGATIVE TENSOR 

FACTORIZATIONS 

SAMANTHA HANSEN, TODD PLANTENGA, TAMARA G. KOLDA 

Abstract. Tensor factorizations with nonnegative constraints have proved to be powerful tools 

for analysis of sparse count data in areas such as bioinformatics and social network analysis. We 

consider application data best described as being generated by a Poisson process (e.g., count data), 

which leads to sparse tensors that are typically modeled by sparse factor matrices. In this paper we 

r-fN investigate efficient techniques for computing an appropriate tensor factorization model and propose 

_„ new subproblem solvers within the standard alternating block variable approach. Our new methods 

f— ^ exploit structure and reformulate the optimization problem as small independent subproblems. We 

*vj employ bound-constrained Newton and quasi-Newton methods. We compare our algorithms against 

other codes, demonstrating superior speed for high accuracy results and the ability to quickly find 

p^ sparse solutions. 
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1. Introduction. In this paper we present second-order methods for nonnega- 
tive tensor models involving the Kullback-Leibler (K-L) divergence. Multiplicative 
update is one of the most widely implemented methods for tensor factorization, but it 
suffers from slow convergence and an inability to discover the underlying sparsity. By 
reformulating the problem and using second-order information, we are able to propose 
algorithms that are both fast in terms of compute time as well as quick in identifying 

^~i sparsity in the factors. 

f^ Tensor factorization is increasingly popular as a tool for analysis of multidimen- 

sional data sets. A case of particular interest is data encoding the total counts of 
an item or event. Example applications include the number of papers published by 
an author at a given conference in a given year |16) , the number of Internet packets 
sent from one IP address to another at a given port [35] , and the frequency of term 
use in emails for a given user in a given time period [3]. Data in these applications 
is often quite sparse, i.e., most tensor elements have a count of zero. Count data is 

^^ also nonnegative and therefore best modeled by nonnegative factors. In this paper we 

^N model the iV-way tensor X as a sum of multilinear components, i.e., 
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l«M = ^A r a( 1 'o...oa( J '>, (1.1) 



where the operator o is an outer product, af 1 is a nonnegative "factor" whose entries 



sum to 1, and A r > is a scalar weight parameter (section 1.2 provides more details) 



The factorization computes values for parameters A, a\ , . . ., a. R by minimizing an 
appropriate objective function subject to nonnegativity constraints. 

Studies show that sparse multiway count data is best modeled under a Poisson 
assumption on the data |10[ 131 j : in other words, Xi ~ Poisson(mi). A maximum 
likelihood formulation leads to a particular objective function, the Kullback-Leibler 
(K-L) divergence function. In this paper we consider minimizing the K-L objective 



with nonnegativity constraints to produce the factorization (1.1 1 



(i) 

We are particularly interested in the case where the computed factors aj. are 
sparse, i.e., contain relatively few nonzero elements. Tensor results are generally 
easier to interpret when factors are sparse. Moreover, many sparse count applications 
can reasonably expect sparsity in the factors. For example, the publication data from 
[16) counts publications by authors at various conferences over a ten year period. The 
tensor representation has a sparsity of 0.14% (only 0.14% of the data elements are 
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nonzero), and the factors computed by our algorithm with R = 60 have sparsity 9.3%, 



2.7%, and 77.5% over the three modes (see section 4.4 ). One meaning of sparsity in the 
factors is to say that a typical outer product term connects about 9% of the authors 
with 3% of the conferences in 8 of the 10 years. Linking authors with conferences is 
an important outcome of the tensor analysis, requiring clear distinction between zero 
and nonzero elements in the factors. We show that our algorithms are particularly 
good at this task. 

We use a standard Gauss-Seidel alternating block framework but develop a new 
formulation of the optimization problem that breaks it into row subproblems. Each 
row subproblem amounts to minimizing a strictly convex function with nonnegativity 
constraints. The reformulation makes it possible to efficiently use second derivatives 
(or their approximation) and is therefore able to compute highly accurate factor- 
izations. We identify variables at their bounds with a two-metric gradient projection 
technique that is fast and robust. Our software implementation is well suited for large, 
sparse tensor data because it only needs to compute terms resulting from nonzero data 
elements. 

We hnd that computing factor matrices to high accuracy (as measured by satisfac- 
tion of the first-order KKT conditions) is effective in revealing sparsity. By contrast, 
popular multiplicative update methods can make elements small but are slow at reach- 
ing the variable bound at exactly zero, forcing the experimenter to guess when "small" 
means zero. We demonstrate that guessing a threshold is inherently difficult, making 
the high accuracy obtained with our methods desirable. 

Our contributions in this paper are as follows: 

1. A new formulation for nonnegative tensor factorization based on the Kullback- 
Leibler divergence objective, resulting in a set of small, independent row sub- 
problems. Each subproblem contains just R variables, where R is the desired 
number of components in ( |1.1| . 

2. Two Matlab algorithms for computing factorizations of sparse nonnegative 
tensors: one using second derivatives and the other using limited memory 
quasi-Newton approximations. The algorithms are made robust with an 
Armijo line search, damping modifications when the Hessian is ill-conditioned, 
and projection techniques based on two-metric gradient projection ideas in 

m- 

3. Test results that compare the performance of our two new algorithms with the 
best available multiplicative update method and an alternating variable block 
quasi-Newton algorithm that does not formulate using row subproblems. 

4. Test results showing the ability of our methods to quickly and accurately find 
sparse factor matrices without using problem-dependent thresholds. 

1.1. Related Work. Nonnegative matrix factorization (NMF) is best known 
from the work of Lee and Seung [33J OS] ■ The considered both Gaussian and Poisson 
models, leading to least squares and K-L divergence objective functions, respectively. 
In both cases, they used an alternating variable block iterative method, i.e., a non- 
linear block Gauss-Seidel method. Each block subproblem is convex, and a simple 
multiplicative update formula usually leads to an acceptable local minimum. Welling 
and Weber generalized NMF to the CANDECOMP/PARAFAC tensor factorization 
(NTF) [26]. 

In terms of the K-L version, it has been extended to a generalized K-L objective 
[14] as well as a Tucker tensor factorization [35] . Chi and Kolda provided an improved 
multiplicative update scheme for K-L that addressed performance and convergence 
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issues as elements approach zero |10| : we compare to their method in Section Bj By in- 
terpreting the K-L divergence as an alternative Csiszar-Tusnday procedure, Zafeiriou 
and Petrou [3j5] provided a probabilistic interpretation of NTF along with a new mul- 
tiplicative update scheme. The multiplicative update is equivalent to a scaled steepest 
descent step [34] . so it is a first-order optimization method. Since our method uses 
second-order information, it allows for convergence to higher accuracy and a better 
determination of sparsity in the factorization. 

Zdunek and Cichocki [301 UJJ proposed a hybrid method for Blind Source Sep- 
aration applications via NMF that used a damped Hessian method similar to ours. 
They recognized that the Hessian of the K-L objective has a block diagonal structure, 
but did not reformulate the optimization problem further as we do. Consequently, 
their Hessian matrix is large, and in both papers they switch to a least squares objec- 
tive function for the larger mode of the matrix because their Newton method cannot 
scale up. Mixing objective functions in this manner is undesirable because it com- 
bines two different underlying models. Our Hessian reformulation can solve both 
modes efficiently and extends to NTF. As a point of comparison, a mode of size 1000 
is considered too large for their Newton method [41] . but our algorithm can factor a 
200 x 1000 data set to high accuracy in about ten minutes. The Hessian-based method 
in [UJ has most of the advanced optimization features that we use (though details 
differ), including an Armijo line search, active set identification, and an adjustable 
Hessian damping factor. 

A similar reformulation to ours was noted in earlier papers exploring the least 
squares function, but it was not used for Hessian-based methods or to exploit spar- 
sity. Gonzales and Zhang used the reformulation with a multiplicative update method 
for NMF [18] but did not generalize to tensors or the K-L objective. Kim and Park 
used the reformulation for NTF [22] , deriving small bound-constrained least squares 
subproblems. Their method solved the least squares subproblems by exact matrix fac- 
torization, without exploiting sparsity, and featured a block principal pivoting method 
for choosing the active set. Other works that solve the least squares norm objective 
by taking advantage of row by row or column by column subproblem decomposition 
include QIJ [25]. 

Our algorithms are similar in spirit to the work of Dongmin et al. [15] , which ap- 
plied a projected quasi-Newton algorithm (called PQN in this paper) to solving NMF 
with a K-L objective. Like PQN, our algorithms identify active variables, compute 
a Newton-like direction in the space of free variables, and find a new iterate using 
a projected backtracking line search. We differ from PQN in reformulating the sub- 
problem and in computing a damped Newton direction; both improvements make a 
huge difference in performance for large-scale tensor problems. We compare to PQN 
in Section 2] 

There are many other works on the least squares version of NTF [5] [T7] [20] 1211 1221 
[211 [301 H [Ml EZ] • In particular, all-at-once optimization methods, including Hessian- 
based algorithms, have been applied to nonnegative tensor factorization with a least 
squares objective function. Acar et al. considered conjugate gradient and nonlinear 
least squares methods on the full set of variables, though without nonnegativity con- 
straints [TJ . Paatero replaced the nonnegative constraints with a barrier function [29 . 
We are not aware of any work on all-at-once methods for the K-L objective in NTF. 

Finally, we note that all methods, including ours, find only a locally optimal 
solution to the NTF problem. Vavasis proved that finding the global optimal solution 
in NMF isNP-hard [36]. 



1.2. Tensor Review. For a thorough introduction to tensors, see [23] and refer- 
ences therein; we only review concepts that are necessary for understanding this paper. 
A tensor is a multidimensional array. An A- way tensor X has size I\ x I 2 x . . . x Ijq. 
To differentiate between tensors, matrices, vectors, and scalars, we use the following 
notational convention: X is a tensor (bold, capitalized, calligraphic), X is a matrix 
(bold, capitalized), x is a vector (bold, lower-case), and x is a scalar (lower-case). 
Additionally, given a matrix X, x^ denotes its j column and x^ denotes its i th row. 

Just as a matrix can be decomposed into a sum of outer products between two 
vectors, an A-way tensor can be decomposed into a sum of outer products between 
N vectors. Each of these outer products (called components) yields an A-way tensor 
of rank one. The CP (CANDECOMP/PARAFAC) decomposition (or Kruskal form) 
[HI [H] represents a tensor as a sum of rank one tensors (see Figure |1.1| : 



X = 



A;AW AW|=^A, 



R 



aWo. 



o a 



(N) 



(1.2) 



where A is a vector, and each A^ n ' is an /„ x R factor matrix containing the R vectors 
contributed by mode n to the outer products; that is, A*-"-* = [a-j™ • • • a^ ]. Equality 



holds in ( 1.2 ) when R equals the rank of X, but often a tensor is approximated by a 
smaller number of terms. We let i denote the multi-index («i, 12, . . . , i^) of an element 
Xi of X. 
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Fig. 1.1: CANDECOMP/PARAFAC decomposition of a 3- way tensor into R compo- 
nents. 



We use of the idea of matricization, or unfolding a tensor into a matrix. Specifi- 
cally, unfolding along mode n yields a matrix of size I n ■ J n , where 

Jn = I\ ' h ' • ■ • ' In-l ' In+l • •••• In- 

We use the notation X( n \ to represent a tensor X that has been unfolded into its n th 



„(«) 



mode, and x\, for its (i,j) element. If a tensor X is written in Kruskal form (1.2), 
then the n th matricized mode is given by 

X (n) = aWa(A^ ... A^" +1 ) A^ ... A^)) T 

where A = diag(A) and denotes the Khatri-Rao product |23j . 

2. Poisson Nonnegative Tensor Factorization. In this section we state the 
optimization problem, examine its structure, and reformulate it into simpler subprob- 
lems. 



We seek a tensor model in CP form to approximate data X: 

R 



XwM 



A;A«,...,AW =£A r «4 1 >o...oaC- 



(N) 
■\ r ** r ~ . . . « ** 



The value of i? is chosen empirically, and the scaling vector A and factor matrices 
A^™' are the model parameters. 

In |10j . it is shown that a K-L divergence objective function results by assuming 
that data elements follow Poisson distributions with multilinear parameters. The 
best-fitting tensor model under this assumption satisfies: 

min fCM.) = > m\ — Xilogm; 

X,A«,...,A( W ) 4^ 

i 

H 

s.t. M = ^A r a[ 1) o...oa( J '), (2.1) 

A r >0,4 n) >0,||a^|| 1 =l ) Vre{l,...i?}, Vn € {1, . ..N}. 

The model may have indices where m\ = and Xi = 0; for this case we define 
OlogO = 0. Note that for matrix factorization, (2.1) reduces to the K-L divergence 



proposed by Lee and Seung [331 [34], an d the tensor form (2.1) is the same as that in 
[39]. 

The constraint that normalizes the column sum of the factor matrices serves to 
remove an inherent scaling ambiguity in the CP factor model. We chose the l\ norm 
for computational ease and to promote sparsity of model elements. 

As in |10j . we unfold X and M into their n th matricized mode, and express the 
objective as 

/(M) = e T [A (n) An (n) - X (n) * log(AWAn("))]e, 

where e is a vector of all ones, 

n W = (A (JV) 0...0A ( " +1) 0A ( "- 1) 0...0A (1) ) T GR i?XJ ", and (2.2) 
A = diag(A) € R RxR . 

The operator * denotes element- wise multiplication, and log(-) is taken element-wise. 
Note that by expanding the Khatri-Rao products in (2.2) and remembering that 
column vectors af 1 are normalized, each row of IT"^ conveniently sums to 1. This is 
a consequence of using the i\ norm in (2.1). 

The above representation of the objective motivates the use of an alternating block 
optimization method where only one factor matrix is optimized at a time. Holding 
the other factor matrices fixed, the optimization problem for A^ and A is 

min /(A, A (n) ) = e T [A ( " ) An (rl) - X (n) * log(A (n) An (n) )]e 
A,Af») (2.3) 

s.t. A > 0, A (n) > 0, e T A (n) = 1. 



Problem (2.3) is not convex. However, ignoring the equality constraint and letting 
B («) = A (n) A, we have 

min /(B (n) ) = e T [B (n) n (n) - X M * log(B ( "»n ( ">)]e (2.4) 

B<' 1 )>0 



which is convex with respect to B^™'. Chi and Kolda show in [10] that (2.4 1 is strictly 
convex given certain assumptions on the sparsity pattern of X( n \ . 

At this point we can define Algorithm [TJ a Gauss-Seidel alternating block method. 
The algorithm iterates over each mode of the tensor, solving a convex optimization 
subproblem. Steps [6] and [7] rescale the factor matrix columns, redistributing the 
weight into A. For the moment, we leave the subproblem solution method in step [5] 
unspecified. A proof that Algorithm [T] convergences to a local minimum of ( |2.1[ ) is 
given in [TO] , 



Algorithm 1 Alternating Block Framework 

Given tensor sizes I\, . . . In, the number of components R, and data tensor X 

Return a model M = [A; A (1) . . . A {N) ] 

1: Initialize A (n) e R 7 " x « for n = 1, . . . , N 

2 
3 

1 



repeat 

for n = 1 , . . . , N do 

Let n« = (A<*> . . . A^ A^" 1 ) ... A«) T 
5: Use Algorithm [2] to compute B* that minimizes /(B (n) ) s.t. B (rl) > 

6: A <- e T B* 

A (») ^_ b*A~\ where A = diag(A) 
end for 
until all mode subproblcms have converged 



This outline of Algorithm [TJ corresponds exactly with the method proposed in 
[TU] ; where we differ is in how to solve subproblem (2.4) in step [5] Note also that 



this algorithm is the same as for the least squares objective (references were given 
in section |1.1[) ; there the subproblem in step p^ is replaced by a linear least squares 



subproblem. We now proceed to describe our method for solving (2.4 1 



2.1. Subproblem Reformulation. We examine the objective function /(B 



("h 



in (2.4) and show that it can be reformulated into independent subproblcms. As men- 
tioned in the previous section, rows of IP™-' sum to one if the columns of factor matrices 
are nonnegative and sum to one. When IT"-* is formed at step|4Jof Algorithm |l] the 
factor matrices satisfy these conditions by virtue of steps [6] and [7j hence, the first 
term of /(B (n) ) is 



(n) 



e T B (n) n (n) e = e T B („) e = g^6 



The second term of f(B^ n ') is a sum of elements from the /„ x J„ matrix X( n ) * 
logf^B^rr™- 1 ). Recall that the operations in this expression are element-wise, so the 
scalar matrix element (i,j) of the term can be written as 



\r=l / 
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Adding all the elements and combining with the first term gives 

/(B (n) ) = E t i n) - E E 4 n) lo § f E t ] 4 

z—l r— 1 i— 1 j — 1 \r— 1 



EAow^r,^,^-)). 



1=1 
where b^ and jq are the i th row vectors of their corresponding matrices, and 

R J n / R \ 

/ row (b, x, II) = E &r - E &o l °S I E " hrlTr i J ' ( 2 ' 5 ) 

r— 1 j — 1 \r— 1 / 



Problem (2.4) can now be rewritten as 

. min £/ rw (b i (n) ,4< n) ,IlC">). (2.6) 

6i, ...,£/„ >o ^ 

This is a completely separable set of I n row subproblems, each one a convex non- 
linear optimization problem containing R variables. The relatively small number of 
variables makes second-order optimization methods tractable, and that is the direc- 
tion we pursue in this paper. Algorithm [2] describes how the reformulation fits into 
Algorithm [l] 



Algorithm 2 Row Subproblem Framework for Solving (2.6) 



Given X( n ) of size /„ x J n , and IT™- 1 of size R x J„ 
Return a solution B consisting of row vectors b 1; . . . , b /jv 
for i — 1, . . . ,I n do 
Select row x, of X( n ) 

Generate one column of IT™** for each nonzero in x, 
Use Algorithm ^ or kl to compute b^ that solves 



min / raw (6 1 W ,i| n) ,n (n) ) subject to b, > 



5: end for 



St ep |3 | of Algorithm [2| exploits sparsity in X(„) to reduce computational costs. 
From ( J2.5[ ) we see that columns of IT™' can be ignored unless they correspond to a 
nonzero clement of Xj ; this point was first published in [10] . 

Algorithm [2] points the way to a parallel implementation of the CP tensor factor- 
ization. We note that each row subproblem can be run in parallel, and storage costs 
are determined by the sparsity of the data. For instance, in a distributed computing 
architecture an algorithm could identify the nonzero elements of each row subproblem 
at the beginning of execution and collect only the data needed to form appropriate 
columns of IT"' at a given processing element. We do not implement a parallel version 
of the algorithm in this paper. 

3. Solving the Row Subproblem. In this section we show how to solve the 



row subproblem (2.6) using second-order information. We describe two algorithms, 
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one applying second derivatives in the form of a damped Hessian matrix, and the 
other using a quasi-Newton approximation of the Hessian. Both algorithms project a 
step to the bound constraints, but the details differ. 

Each row subproblem is a strictly convex function of R variables with nonnega- 
tivity constraints. One of the most effective methods for solving bound constrained 
problems is second-order gradient projection; cf. |32] . We employ a form of two-metric 
gradient projection from Bertsekas ^7 . Each variable is marked in one of three states 
based on its gradient and current location: fixed at its bound (zero), allowed to move 
in the direction of steepest descent, or free to move along a Newton or quasi-Newton 



search direction (details are in section 3.1| 



An alternative to Bertsekas is to use methods that employ gradient projection 
searches to determine the active variables (those set to zero). Examples include the 
generalized Cauchy point [12] and gradient projection along the steepest descent di- 
rection with a line search. We experimented with using the generalized Cauchy point 
to determine the active variables, but preliminary results indicated that this approach 
sets too many variables to be at their bound, leading to poor overall performance. 
Gradient projection steps with a line search calls for an extra function evaluation, 
which is undesirable due to its computational cost. Given a more efficient method for 
evaluating the function, this might be a better approach since, under mild conditions, 
gradient projection methods find the active set in a finite number of iterations [6]. 

For notational convenience, in this section we use b for the column vector repre- 

-t , 

sentation of row vector b^; that is, b = b^ . Iterations are denoted with superscript 

k, and V r represents the derivative with respect to the r th variable. Let -P+[v] be 

the projection operator that restricts each element of vector v to be nonnegative. We 

make use of the first and second derivatives of / r0 wi given by 

v r / row (b) = g /™(^*' n ) = i - f; -gg , (3.i) 

V7 2 f (h\ — /low(P,X, II) _ ^-\ XjTT r jTT s j 

°°r dbs 3=1 (Ei=i h ^a) 

3.1. Two-Metric Projection. At each iteration k we must decide which vari- 
ables are at their lower bound value of zero. It is important to choose this set so 
that progress is made on decreasing the objective. Bertsekas demonstrated in [7J that 
iterative updates of the form 

b fe+1 = P + [b fe - aM fc V/ row (b & )] 

are not guaranteed to decrease the objective function unless M is a positive diagonal 
matrix. Instead, it is necessary to predict the variables that have the potential to 
make progress in decreasing the objective, and then update just those variables using 
a positive definite matrix (in our case, second-order information). We present the 
two-metric technique of Bertsekas as it is executed in our algorithm, which differs 
superficially from the presentation in [7J. 

A variable's potential effect on the objective is determined by how close it is to 
zero and by its direction of steepest descent. If a variable is close to zero and its 
steepest descent direction points towards the negative orthant, then the next update 
will most likely project the variable to zero and its small displacement will have little 



effect on the objective. A closeness threshold e k is computed from a user defined 
parameter e > as 



e k = min(w fe , e), w k = b k P+[b k - V/ row (b fc )] 
We then define index sets 

.4(b fc ) = {r | b k = 0, V r / row (b fc ) > 0} , 
G(h k ) = {r\0<b k < e k) V r / row (b fc ) > 0} , 
T(b k ) = („4(b fc ) U G(b k ) 



(3.2) 



(3.3) 



Variables in the set A are fixed at zero, variables in Q move in the direction of the 
negative gradient, and variables in J- are free to move according to second-order 
information. Note that if e = then e k — 0, Q is empty, and the method reduces to 
defining an active set of variables by instantaneous line search [3]; we use this in the 



quasi-Newton method described in section 3.2 



3.1.1. Damped Newton Step. The damped Newton direction is taken with 



respect to only the variables in the set T from (3.3 1. Let 



g£ = [V/ row (b fc )b, H* = [V 2 /row(b fc )]^, b k F = [b k } r , 

where [v]jr chooses the elements of vector v corresponding to variables in the set T. 
Since the row subproblems are strictly convex, the full Hessian and H F are positive 
definite. 

The damped Hessian has its roots in trust region methods. At every iteration we 
form a quadratic approximation m k of the objective plus a quadratic penalty. The 
penalty serves to ensure that the next iterate does not move too far away from the 
current iterate, which is important when the Hessian is ill-conditioned. The quadratic 
model plus penalty expanded about b for variables dp <E W^' is 



T„k 



1 



2 



If\\1 



(3.4) 



m k (d F ;fi k ) = / row (b K ) + d^g£ + -d^H£d F 
The unique minimum of m k (-) is 

d F = -(n F + » k i)- 1 SF , 

where H F + /i k I is known as the damped Hessian. Adding a multiple of the identity 
to H F increases each of its eigenvalues by /i k , which has the effect of diminishing the 
length of d F , similar to the action of a trust region. The step d F is computed using 
a Cholesky factorization of the damped Hessian, and the full space search direction 
R is then given by 



d fc e 



d k = 



(4)r 


r e T 


-V r / row (b fc ) 


reG 





r e A, 



(3.5) 



where index sets A, Q, and J- are defined in (3.3) 
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The damping parameter fik is adjusted by a Levenberg-Marquardt strategy |28j . 
First define the ratio of actual reduction over predicted reduction, 

/ row (b fc +d fc )-/ row (b fc ) 
m fe (d^;0)-m fc (0;0) 



where m&(-) is defined by (3.4). Note the numerator of (3.6 1 calculates / row using all 
variables, while the denominator calculates mfc(-) using only the variable in J- . The 
damping parameter is updated by the following rule 

\ip<\ 
/'/. + ! { I in if p> | (3.7) 

otherwise. 



ih 



Since d F is the minimum of (3.4), the denominator of (3.6) is always negative. If 



the search direction d increases the objective function, then the numerator of p 
will be positive; hence p < and the damping parameter will be increased for the 
next iteration. On the other hand, if the search direction d decreases the objective 
function, then the numerator will be negative; hence p > and the relative sizes of the 
actual reduction and predicted reduction will determine how the damping parameter 
is adjusted. 

3.1.2. Line Search. After computing the search direction d : , we ensure the 
next iterate decreases the objective by using a projected back-tracking line search 
that satisfies the Armijo condition [28]. Given scalars < j3 and a < 1, we find the 
smallest integer t that satisfies the inequality 

/ row (P+[b fc + /3*d fc ]) - / row (b fe ) < a(P + [b k + p& k ] - b fe ) r V/ row (b fe ). (3.8) 

We set a,t = /3* and the next iterate is given by 

b k+1 = P+[b k + a k d k }. 

3.2. Projected Quasi-Newton Step. As an alternative to the damped Hes- 
sian step, we adapt the projected quasi- Newton step from [15] . Their work employs a 
limited memory BFGS (L-BFGS) approximation [27] in a framework suitable for any 
convex, bound constrained problem. 

L-BFGS estimates Hessian properties based on the most recent M update pairs 
{s 1 , y l } , i G [ max{ 1 , k — M} , k } , where 

s k = b fe+i _ b fe ; y k = v/ row (b fe+1 ) - V/ row (b fc ). 

L-BFGS uses a two-loop recursion through the stored pairs to efficiently compute a 
vector p fe = H g* , where H approximates the inverse of the Hessian H ' using the 
pairs {s*,y 4 }. Storage is set to M = 3 pairs in all experiments. See Chapter 7 of [2"8] 
for further detail. 



The projected quasi-Newton search direction d , analogous to (3.5), is 



4= H reJ ; 0.9) 
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where T and A are determined from (3.3| with e = in (3.2). Although e = forfeits 



the guarantee of convergence in solving the row subproblem (see section 3.1 1, we argue 



that it improves the accuracy of the computed step d in (3.9) 



The inverse Hessian approximation H is built from update pairs that span the 
full space of variables. This is necessary because the subspace of free variables can 



change at every iteration. Ideally, (3.9) should compute the step for variables in T 
usi ng o nly second-order information pertaining to those varia bles, as is done with d F 



in (3.5). Instead, p is computed over all variables and then (3.9) zeros out the step 



for variables not in J- . 

We can express this difference in terms of the reduced Hessian. Let B denote the 
true Hessian matrix. Suppose the variables in J- are the first 1, . . . , |J-"| variables, and 
the remaining variables are in Af — A U Q . Write B in block form as 



Bff 

BjVF 



NF 

NN 



( l^l, and B NN G rW*W, i n (3.5 ) we use the 



with B FF G Rl^l*!- 77 !, B NF G 
reduced inverse Hessian; that is, H F = B^. To get (3.9) we form the inverse of 
the Hessian over all variables and take just the rows and columns corresponding to 
variables in J 7 , expressed as [B~ ]jf. Assuming the Schur complement exists, 



[B 



\f 



(B 



FF 



B n pB N N B ' N F ) , 



a matrix of rank 



and we see that this differs from H F by the term B NF B NN Bj^ F , 

\JV\. Hence, if \J\f\ is small then the difference should be less, which is precisely the 



effect we get by using e = in (3.2 ) 



3.3. Stopping Criterion. Since the row subproblems are convex, any point 
satisfying the first-order KKT conditions is an optimal solution. Specifically, b* is a 



KKT point of (2.6) if it satisfies 



V/ row (b*) - v* = 0, (b*) T v* =0, b* > 0, v* > 0, 

where v* is the vector of dual variables associated with the nonnegativity constraints. 
Knowing the algorithm keeps all iterates b nonnegative, we can express the KKT 
condition for component r as 



min{^,V r / row (b fe )} 



0. 



A suitable stopping criterion is to approximately satisfy the KKT conditions to a 
tolerance r > 0. We achieve this by requiring that all row subproblems satisfy 



kkt viol = sqrt I ^[min{&;?, V r / row (b fe )}] 



< r. 



(3.10) 



The full algorithm solves to an overall tolerance r when the kkt V ioi of every row 
subproblem satisfies (3.10). This condition is enforced for all the row subproblems 
(step El of Algorithm 121) generated from all the tensor modes (step [5] of Algorithm fll). 
Note that enforcement requires examination of kkt v ; i for all row subproblems when- 
ever the solution of any subproblem mode is updated, because the solution modifies 
the IT™-* matrices of other modes. 
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Algorithm 3 Projected Newton-Based Solver PDN-R for the Row Subproblem 

Given data x and II, constants fi , v, ft, K max , stop tolerance r, and initial values b° 
Return a solution b* to step [J] of Algorithm [2J 
l: for k = 0, 1, ...,K max do 

Compute the gradient, g fe = V/ row (b ), using x and II in (3.1 1 



Compute the first-order KKT violation 



KKtvi 



viol 



= sqrt y^[min{^ 



9r}f 



> Converged to tolerance. 



if kkt v i i < t then 
return b* = b k 
end if 

Find the indices of free variables from (3.3| with e = 10~ 3 in (3.2) 
Calculate the Hessian for free variables 



H 



[V 2 /row(b fc )]^ 



9 
10 

11 
12 



Compute the damped Newton direction d F = — (H F + fi^I) g F 
Construct search direction d over all variables using d F and g fc in (3.5) 



Perform the projected line search (3.8) using a and f3 to find step length a^ 
Update the current iterate 



b fc+1 = P + [b* 



a k d k ] 



Update the damping parameter /ik+i according to (3.6)-(3.7) 
end for 



return b* 



> Iteration limit reached. 



3.4. Row Subproblem Algorithms. Having described the ingredients, we pull 
everything together into complete algorithms for solving the row subproblem in step HI 
of Algorithm [2| We present two methods in Algorithm [3] and Algorithm |4j PDN-R 
uses a damped Hessian matrix within a two-metric projection framework, and PQN-R 
uses a quasi-Newton Hessian approximation with instantaneous line search. 
As mentioned, PQN-R is related to [T5]. Specifically, we note 

The free variables chosen in step [7] of PQN-R uses e = in (3.2) instead of 

e = 1CU 3 . The choice is unchanged from [15] . 

The line search in step|9]of PQN-R is the same as the Armijo search in PDN- 

R. This differs from [TS], which used aa(d ) T V/ row (b ) on the right-hand 



1. 



side of (3.8). We use rt3.8| because it correctly measures predicted progress. 



l k \T\ 



In particular, it is easier to satisfy when (d ) V/ row (b ) is large and many 
variables hit their bound for small a. 
3. Updates to the L-BFGS approximation in step [Tl] of PQN-R are unchanged 
from [15| . Information is included from all row subproblem variables, whether 
active or free. 

4. Experiments. This section characterizes the performance of our algorithms, 
comparing them with multiplicative update [TD] and a PQN method [TS]. All four 
algorithms fit in the alternating block framework of Algorithm [Tl differing in how they 
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Algorithm 4 Projected Quasi-Newton Solver PQN-R for the Row Subproblem 

Given data x and II, constants fi , v, ft, K max , stop tolerance r, and initial values b° 
Return a solution b* to step [J] of Algorithm [2J 

l: for k = 0, 1, ...,K max do 

2 

3 



Compute the gradient, g k = V/ row (b ), using x and II in (3.1 1 
Compute the first-order KKT violation 



kkt viol = sqrt ^[ 



l{b k r,g k r}f 



if kkt v i i < T then 

return b* = b > Converged to tolerance. 

end if 



Find the indices of free variables from (3.3 ) wi th e = in (3.2 1 



Construct search direction d using g in (3.9) 



Perform the projected line search (3.8) using a and /3 to find step length a^ 
Update the current iterate 

b k+1 = P+[b k + a k d k ] 

Update the L-BFGS approximation with b fc+1 and g fc+1 
end for 
return b* = b > Iteration limit reached. 



solve (2.4) 



Our two algorithms are the projected damped Hessian method (PDN-R), and 
the projected quasi-Newton method (PQN-R), from the pseudo-code of Algorithm [3] 
and [4] Rather than tune the algorithms to each tensor data set, we chose a single 
set of values: fip = 1(U 5 , a = 10~ 4 , and (3 = 1/4. The bound constraint threshold 



in PDN-R from (|3_2] was set to e = 1(T 3 . The L-BFGS approximations in PQN-R 
stored 3 pairs of updates. 

The multiplicative update algorithm that we compare with is that of Chi and 
Kolda [TO], available as function cp_apr in the Matlab Tensor Toolbox [5], and called 
MU in this section. It builds on tensor generalizations of the Lee and Seung method, 
specifically treating inadmissible zeros (their term for factor elements that are active 
but close to zero) to improve the convergence rate. Algorithm MU can be tuned by 
selecting the number of inner iterations for approximately solving the subproblem at 
step [5] of Algorithm [T] We found that 10 inner iterations worked well in the first 
experiment and used the same value in all experiments. 

We also compare to a projected quasi-Newton algorithm adopted from Dongmin 



et al. [J5], called PQN in this section. PQN is similar to PQN-R but solves (2.4 1 



without reformulating it into row subproblems. Like PQN-R, PQN identifies the 



active set using e = in (3.2), and maintains a limited memory BFGS approximation 
of the Hessian. However, PQN uses one L-BFGS matrix for the entire subproblem, 
storing 3 pairs of updates. We used Matlab code from the authors of [15], embedding 
it in the alternating framework of Algorithm [l] with the modifications described in 
section 13.21 

Results show that our algorithms have a faster convergence rate than MU updates 
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in making the kkt V j i small, as expected for Hessian-based methods. Moreover, the 
reformulation into row subproblems (PQN-R) is much faster than PQN. Our methods 
scale well, are robust when faced with degenerate data, and are the fastest algorithms 
if accurate solutions are required. We also find that our algorithms more quickly 
reach good solutions with high sparsity, a desirable feature when the factor matrices 
are expected to be sparse. 

All algorithms were coded in Matlab using the sparse tensor objects of the Tensor 
Toolbox [5]. All experiments were performed on a Linux workstation with 12GB 
memory. Data sets were large enough to be demanding but small enough to fit in 
machine memory; hence, performance results are not biased by disk access issues. 

4.1. Solving the subproblem with PDN-R. We begin by examining algo- 



rithm performance on the convex subproblem ( 2.4 ) of the alternating block framework. 
As algorithms iterate through alternating modes of the tensor, subproblems of differ- 
ent shapes are solved; here we look at a single representative subproblem. Our goal 
is to characterize the relative behavior of algorithms on the representative convex 
subproblem. 

Appendix [A] describes our method for generating synthetic test problems with 
reasonable sparsity. We investigate a 3-way tensor of size 200 x 300 x 400, generating 
S = 500, 000 data samples. The number of components, R, is varied over the set 
{20, 40, 60, 80, 100}. For each value of R, the procedure generates a multilinear m odel 



4.1 



M = [A; A' 1 ', . . . , A^J with sparse factor matrices and data tensor X. Table 
lists the number of nonzero elements found in the data tensor % that results from 
500, 000 data samples 



Table 4.1: 


Subproblem 


sparsity 


for number of cor 




R 


Number Nonzeros 


Density 




20 




408,053 




1.70% 




40 




450,217 




1.88% 




60 




465,086 




1.94% 




80 




471,079 




1.96% 




100 




475,025 




1.98% 



We consider just the subproblem obtained by unfolding along mode 1; hence, 



the test case will contain 200 row subproblems of the form (2.6). We want to run 
several trials of the subproblem solver from different initial guesses of the unknowns, 
holding A^ ' and A^ ' from M constant. The initial guess draws each element of 
A^ 1 ' from a uniform dist ribution on [0, 1) and sets each element of A to one. To 



satisfy constraints in (2.1 1, the columns of A^ ' are normalized to sum to 1, with the 



normalization factor absorbed into A. The mode 1 subproblem (2.4) is now defined 
with II = (A (3) A (2) ) T , X = X (1) , and B = A (1) A, with unknowns B initialized 
to a particular trial value as the starting guess in any algorithm. 

We first characterize the behavior of our Newton-based algorithm, PDN-R. To 
solve just the mode 1 subproblem, the for loop at step [3] of Algorithm [T] is changed 
to n = I. Row subproblems are solved using Algorithm [3] with r = 10 -8 and the 
parameter values mentioned at the beginning of this section. The value of K max 
in Algorithm [3] is large enough that the kkt v ; i converges to r before K n 
reached. 
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10 20 30 
iterations 

(a) R = 20 



10 20 30 
iterations 

(b) R = 60 



10 20 30 
iterations 

(c) R = 100 



200 F 



200 



200 F 




10 20 30 
iterations 

(f) R = 100 

Fig. 4.1: Convergence behavior of Algorithm (PDN-R) over 10 runs with different 
start points, for three values of R. The upper graphs plot log(kkt vio i), showing how 
the maximum violation over all row subproblems varies as the number of iterations 
increases. The lower graphs plot the number of rows violating the KKT-based stop 
tolerance. 




10 20 30 40 

iterations 

(a) Cumulative execution 
time for R = 100 



10 20 30 40 
iterations 

(b) Number of zero entries in 
A* 1 ' for R = 100 



Fig. 4.2: Additional convergence behavior of Algorithm^ (PDN-R) over 10 runs of 
with different starting points. 
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The upper graphs of Figure p4~T| show how KKT violations decrease with iteration 
for three different values of R. The subproblem was solved 10 times from different 
randomly chosen start points. (Since the subproblem is strictly convex, there is a 
single unique minimum that is reached from every start point.) Each line plots the 
maximum violation over all 200 row subproblems for one of the 10 runs. The figure 
demonstrates that after some initial slow progress, row subproblems exhibit the fast 
quadratic convergence rate typical of Newton methods. 

The lower graphs of Figure |4.1| show how many row subproblems satisfy the 
KKT-based stop tolerance after a given number of iterations. Remember that all row 
subproblems must satisfy the KKT tolerance before the algorithm declares a solution. 

Figure |4~2| shows additional features of the convergence, for just the case of R = 
100 components (behavior is similar for other values of R). In (a) we see that execution 
time levels off as more row subproblems satisfy the convergence tolerance. In (b) we 
see the number of elements of A^ ' exactly equal to zero. The data for this experiment 
was generated from a sparse K-tensor plus noise (see Appendix ffik; hence, we expect 
a sparse solution. The plot indicates that sparsity can be achieved after reducing 
kktvioi to a moderately small tolerance (around 10 2 in this example). We will return 
to sparsity of the solution in the sections below. 

4.2. Row Subproblem Formulation Compared with PQN. Next we com- 
pare the PQN algorithm [TS] with our method, PQN-R, demonstrating the huge 
speedup achieved with our row subproblem formulation. We compare the algorithms 
on the sam e sub problem as section |4.1| from the same 10 random initial guesses for 
A (1) . Table ^ 
the kktvioi 



4.2 



lists the average CPU times over 10 runs. PQN-R was executed until 
was less than 10 -8 . PQN was unable to achieve this level of accuracy, so 
execution was stopped at a tolerance of 10 -3 . Results in the table show that PQN-R 
is much faster at decreasing the KKT violation. We note that a KKT violation of 
10~ 8 is approximately the square root of machine epsilon, the smallest practical value 
that can be attained. 



Table 4.2: Convergence comparison in solving the subproblem 







Algorithm 


PQN 




Algorithm 


PQN-R 


R 


kkt v i i < 10 _1 


kkt v i i < 10" 3 


kktvioi < 10 _1 


kktvioi < 10~ 8 


20 




625 sees 




690 sees 




12.4 sees 




17.1 sees 


40 




755 sees 




846 sees 




10.9 sees 




16.4 sees 


60 




822 sees 




920 sees 




11.3 sees 




16.8 sees 


80 




1022 sees 




1141 sees 




13.7 sees 




19.5 sees 


100 




993 sees 




1125 sees 




13.1 sees 




20.2 sees 



The two algorithms also differ in how they discover the number of elements in 
A^ ' equal to zero. Both eventually agree on the number of zero elements, but PQN-R 
is much faster. Figure [4~3] shows progress made by the two algorithms. The behavior 
of PQN for this quantity is erratic and slow to converge (PQN runs eventually identify 
the correct number of zero elements). We believe algorithm PQN-R converges to the 
correct number more quickly because it computes a projection of variables to the 
bounds of zero for each row subproblem. 

Algorithm PQN should be relatively more competitive for tensor subproblems 
with a small number of rows, but many applications are of the size we consider in 
at least one tensor mode. We see no strong advantages to using PQN and do not 
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Fig. 4.3: Number of elements equal to zero in A^ 1 ' found by PQN-R (solid lines, 
forming a short segment in the upper left) and PQN (dashed lines) as a function of 
compute time, for a subproblem with different values of R. The same subproblem 
was solved from six different start points, corresponding to the different colors. The 
red horizontal line shows the true value. 



consider it further. 

4.3. Algorithms PDN-R, PQN-R, and MU on the Subproblem. Next we 

compare our new row-based algorithms, PDN-R and PQN-R, with the multiplicative 
update method MU [TO]. Again we use the subproblem of section 4.1 from the same 
10 random initial guesses for A^ '. 

As described in section |4j MU is a state of the art representative of the most 
common algorithm for nonnegative tensor factorization. It is a form of scaled steepest 
descent with bound constraints .34! , and therefore is expected to converge more slowly 
than Newton or quasi-Newton methods. We see this clearly in Table |4.3| for two 
different stop tolerances. The MU algorithm was executed with a time limit of 1800 
seconds per problem, and failed to reach kkt V i i < 10~ 3 before this limit when R was 
60 or larger. 



Table 4.3: Time to reach stop tolerance for 3 algorithms (averaged over 10 runs) 





kkt v i i = 10 


2 


kkt v i i = 10" 


-3 


R 


PDN-R 


PQN-R 


MU 


PDN-R 


PQN-R 


MU 


20 


8.1 sees 


14.5 sees 


97.7 sees 


8.1 sees 


15.6 sees 


161.3 sees 


40 


25.1 sees 


13.1 sees 


239.2 sees 


25.2 sees 


14.6 sees 


485.9 sees 


60 


53.6 sees 


13.8 sees 


469.2 sees 


53.7 sees 


15.6 sees 


>1800 sees 


80 


92.8 sees 


16.3 sees 


455.4 sees 


92.9 sees 


18.1 sees 


>1800 sees 


100 


139.8 sees 


16.0 sees 


730.7 sees 


140.0 sees 


18.3 sees 


>1800 sees 



Of course, the disparity in convergence time is more pronounced when a smaller 



KKT error is demanded. Figure 4.4 shows the decrease in KKT violation as a function 
of compute time. Here we see that MU makes a faster initial reduction in KKT 
violation than PDN-R or PQN-R, but then it slows to a linear rate of convergence. 
Notice the gap from time zero for PDN-R and PQN-R, which reflects setup cost before 
the first iteration result is computed. For PQN-R the setup time is fairly constant 
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with R (about 3.8 seconds), while PDN-R has a setup time that increases with R (11.5 
seconds for R = 100). Unlike MU, both algorithms must construct software structures 
for all row subproblems before a first iteration result appears. Figure |4~4| also reveals 
that PDN-R is slower relative to PQN-R as the number of factors R increases. This 
is because the cost of solving a Newton-based Hessian is 0(R 2 ), while the limited 
memory BFGS Hessian cost is O(R). 




5 10 15 
time (seconds) 

(a) R = 20 



20 40 

time (seconds) 

(b) R = 60 



50 100 

time (seconds) 

(c) R = 100 



Fig. 4.4: Convergence behavior comparison on subproblem for different values of R (10 
runs each). Algorithm MU (blue) makes fast initial progress in reducing the violation, 
but slows dramatically after reaching a violation of about 0.1. PDN-R (black) and 
PQN-R (green) reduce the violation much further, with PQN-R being faster for higher 
values of R. 
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Fig. 4.5: Effectiveness of the 3 algorithms in finding a sparse solution for different 
values of R. In each case the number of elements in A^ 1 ' equal to zero is plotted 
against execution time. The PDN-R (black) and PQN-R (green) algorithms are much 
faster than MU (blue) at finding the zeros. The horizontal red line shows the true 
value. 



Figure |4.4| indicates that algorithm MU is preferred if a relatively large kkt v ioi is 
acceptable. We argue that this is not acceptable if the goal is to find a sparse solution. 
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Fig. 4.6: Comparison of PDN-R, PQN-R and MU in finding elements of A*- 1 ^ equal 
to zero for a sample run. MU rarely finds exact zeros; therefore, we show results of 
applying various thresholds. Some experimentation may be needed to find the best 
threshold; regardless, it is slower than the methods proposed here. 



Figure 4.5 plots the number of elements that equal zero as a function of CPU time. 
It shows that PDN-R and PQN-R both converge to the correct number much faster 
than MU. 

On closer inspection we see that MU is actually making factor elements small, 
and is just very slow at making them exactly zero. If we choose a small positive 
threshold instead of zero, then MU might arguably do well at finding a sparse solution. 



Figure 4.6 summarizes an investigation of this idea. Three different thresholds are 
shown: 10~ 3 , ICU 4 , and 10~ 5 . The first threshold is clearly too large, declaring 
elements to be "zero" when they never converge to such a value. A threshold of ICP 4 
is also too large for R = 20, though possibly acceptable for R — 40 and R = 60. The 
choice of 10 -5 correctly identifies elements converging to zero, but PDN-R and PQN- 
R identifies them much faster. We conclude that PDN-R and PQN-R arc significantly 
better at finding a true sparse solution than MU, in terms of robustness (no need to 
choose an ad-hoc threshold) and computation time (assuming a suitable threshold for 
MU is known). 

4.4. Solving the Full Problem. In this section we move from a convex sub- 



problem to solving the full factorization (2.1 ). We generate the same 200 x 300 x 400 
tensor data as in section 4.1 and now treat all modes as optimization variables. An 
initial guess is constructed for all three modes in the same manner that A^ ' was ini- 
tialized in section [4~T| We generate 10 different tensors by changing the random seed 
used in Algorithm [5j and solve each from a single initial guess. All solvers used the 
same initial guess; however, since the full factorization is a nonconvex optimization 
problem, algorithms may converge to different solutions (each being a local minimum). 
We expect our local solutions to be reasonably close to the multilinear model 
M = [A; A^ 1 **, . . . , A^ '1 that generated the synthetic tensor data. We compared 
computed solutions with the original model using the Tensor Toolbox function score 
with option greedy. This function considers all factor matrices and weights, producing 
a number between zero (poor match) and one (exact match) J2]. Solutions computed 
with PDN-R to a tolerance of r = 10 -4 scored above 0.90. In contrast, a computed 
solution matched against the synthetic factor matrices of a different tensor scored less 
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than 0.02. These results show that an accurate factorization can yield the original 
factors for our test problems; however, our focus is on behavior of the algorithms in 
computing a solution. 

The full problem and convex row subproblem have similar convergence rates of 
the KKT violation. Table 14.41 summarizes the time taken to reach a KKT threshold 
of 10~ 3 . The PDN-R and PQN-R methods still converge to this relatively accuracy 
much faster than MU. This is not surprising, again showing the value of second-order 
derivative information. As in the subproblem, we see that PQN-R is faster relative to 



PDN-R as the number of factors R increases. Figure 4.7 shows convergence behavior 



of the full problem as Figure 4.4 showed behavior of the convex subproblem. The 
KKT error of the full problem does not reach the quadratic rate of decrease seen in 
the subproblem. This is due to nonconvexity of the full factorization problem, and 
the alternation between solutions of each mode. 



Table 4.4: Time to reach stop tolerance 10 -3 on full problem (over 10 runs). Mean 
and standard deviation are reported. Some runs of MU failed to reach the tolerance 
in 1000 iterations. 



R 


PDN-R 


PQN-R 


MU 




20 


467 ± 208 sees 


485 ± 189 sees 


912 ± 412 sees 


(0 failures) 


40 


819 ± 295 sees 


961 ± 504 sees 


1666 ±501 sees 


(2 failures) 


60 


1270 ± 506 sees 


1300 ± 515 sees 


2192 ± 357 sees 


(5 failures) 


80 


1359 ± 306 sees 


832 ± 167 sees 


2634 ± 559 sees 


(1 failure) 


100 


1860 ± 338 sees 


959 ± 237 sees 


3620 ±1018 sees 


(1 failure) 



As with the subproblem, we also observe better convergence by our methods to a 
sparse solution. Figure |4~8| shows PDN-R and PQN-R reaching the final count of zero 
elements much faster than MU. As in section |4~3} we argue that PDN-R and PQN-R 
are superior when the task is to find a solution with correct sparsity. 
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Fig. 4.7: Convergence behavior of the PDN-R (black lines) and PQN-R (green lines) 
algorithms in computing a full 3-way solution. Each algorithm was run from 10 
variants of the tensor data. 



We also ran the same three algorithms on the sparse 3-way tensor of DBLP data 
[13) examined in |16j . The data counts the number of papers published by author i\ at 
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Fig. 4.8: Effectiveness of the algorithms in finding a sparse solution for a full 3-way 
solution. In each case the total number of elements in A' ' , A*- 2 ', and A^ 3 -* equal to 
zero is plotted against execution time. The PDN-R (black lines) and PQN-R (green) 
algorithms are much faster than MU (blue). Each algorithm was run with 10 variants 
of the tensor data, so the final number of zero elements is different in each case. 



conference ii in year i^, with dimensions 7108 x 1103 x 10. The tensor contains 112,730 
nonzero elements, a density of 0.14%. The data was factorized for R between 10 and 
100 in [16] (using a least squares objective function), so we use R € {20, 60, 100} in our 
experiments. Behavior of the algorithms on the DBLP data was similar to behavior 
on our synthetic data. Figure [479] shows convergence of the count of elements equal to 
zero, for 10 runs that start from different random initial guesses. Again we see that 
PDN-R and PQN-R reach a sparse solution faster than MU. 
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Fig. 4.9: Effectiveness of the algorithms in finding a sparse solution for the DBLP 
tensor. In each case the total number of elements in A' ', A' •', and A^ 3 -* equal to 
zero is plotted against execution time. The PDN-R (black lines) and PQN-R (green) 
algorithms are much faster than MU (blue). 



Factorizations of the DBLP data computed with PDN-R and PQN-R were also 
quite sparse, making them easier to interpret. The fraction of elements exactly equal 
to zero in the computed conference factor matrix was 98.1%. The author factor 
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matrix was also very sparse, with 95.4% of the elements exactly zero. These results 
were for a factorization with R = 100, stopped after 800 seconds of execution with 
the KKT violation reduced to around 5 x 10 -4 . Figure 4.10 shows a component that 
detects related conferences that took place only in even years. The two dominant 
conferences are the same as those reported in Figure 7 of [16]. Figure 4.11 shows 
another component that groups conferences that took place only in odd years. In 
both components the sparsity is striking, especially for the conference factor. 
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Fig. 4.10: Computed factors from DBLP data for component 26 (i.e., the 26-th largest 
component by weight). The two dominant conferences, ECAI and KR, occurred only 
in even years, except for KR in 1991. 91% (6456) of elements in the author factor are 
exactly zero, as are 98% (1084) conference elements. 



5. Summary. In this paper we derived a row subproblem formulation for non- 
negative tensor factorization of the Kullback-Leibler objective that allows efficient use 
of second order information. We wrote and tested two new algorithms that exploit the 
row subproblem reformulation: PDN-R uses second derivatives in the optimization, 
while PQN-R uses a quasi-Newton approximation. We showed that both algorithms 
are much faster than the best multiplicative update method when high accuracy so- 
lutions are desired. We further showed that high accuracy is needed to identify zeros 
and compute sparse factors without resorting to the use of ad-hoc thresholds. This 
is important because sparse count data is likely to have sparsity in the factors, and 
sparse factors are always easier to interpret. 

Our Matlab algorithms work with the Tensor Toolbox [5] . We mentioned in sec- 
tion |2.1| that row subproblems can be solved in parallel, and we anticipate developing 
other versions of the algorithms for shared and distributed memory machines. 
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Fig. 4.11: Computed factors from DBLP data for component 41. The three dominant 
conferences (ICDAR, ICIAP, and CAIP) occurred only in odd years. 93% (6626) of 
elements in the author factor are exactly zero, as are 98% (1083) conference elements. 
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Appendix A. Generating Synthetic Test Data. 

The goal is to create artificial nonnegative factor matrices and from these compute 
a data tensor whose elements follow a Poisson distribution with multilinear parame- 
ters. Factorizing the data tensor should yield quantities that are close to the original 
factor matrices. The procedure is based on the work of |10) . 

The data tensor should be sparse, reflecting Poisson distributions whose probabil- 
ity of zero is not negligible. Each entry is a count of the number of samples assigned 
to this cell, out of a given total number of samples S. We generate factor matrices in 
each tensor mode and treat them as stochastic quantities to draw the S samples that 
provide data tensor counts. 

Our generation procedure utilizes the function create_problem from Tensor Tool- 
box for Matlab [5], supplying a custom function for the Factor_Generator param- 
eter (available as Matlab code from the authors). We create a multilinear model, 
M = [A; A (1) . . . A (JV) ], where A (n) € n'" xR and A G TZ R , and sizes I n and R are 
given. The model is generated by the following procedure: 

Algorithm 5 Generation of Synthetic Sparse Poisson Tensor Data 

Given tensor sizes I\, . . .In, number of components R, and number of samples S. 

Return a model M = [A; A^ 1 ' . . . A*- '], and corresponding sparse data tensor X. 

l: In each column of A^"-* , choose 20% of the elements at random and set their value 

to 1 + 10i?2;, where a; is a random value from a uniform distribution on [0, 1]. Set 

the other elements equal to the small constant 0.1. 
2: Choose random values for elements in A from a uniform distribution on [0, 1]. 
3: Rescale each column of A'™' so entries sum to 1, absorbing the scale factor into 

the corresponding element of A. 
4: Rescale the vector A so entries sum to 1. 
5: for s = 1, ... ,S do 

6: Treating A as a distribution, choose a component r at random. 
7: Treating the r-th column of A^ ' as a distribution, choose an index i\ with 

probability proportional to & r . Do the same for indices i%, . . . ,ijy, resulting in 

the index i chosen with probability 

PCi) = a {1) a {2) a iN] 

* W u i 1 r u i 2 r ■ ■ ■ u i N r- 

Increment the i-th entry of X by one. 
end for 

Rescale A <— SX so that ||A||i = S. > Recall step|3]sets ||A||i = 1. 
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Step [T] defines a strong preference for certain values of each index. As R increases, 
the relative probability of these indices is also increased so that they continue to stand 
out as strong preferences. 



Step 10 rescales A so that the t\ norms of the generated data tensor "X and 
K-tensor are the same in any mode-n unfolding. For example: 



|x (1) || 1 = |aWa(aW0... a( 2 ') 5 

h R 

z—l r— 
R 

= 5> 



=1 r=l 



r=l 



The second equality uses the fact that rows of the Khatri-Ra o pro duct sum to 1 when 
columns of A'™' sum to 1 (see the comments after equation (2.2)). 
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